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Abstract 

With current technologies, it seems to be very difficult to implement quantum computers with many 
qubits. It is therefore of importance to simulate quantum algorithms and circuits on the existing computers. 
However, for a large-size problem, the simulation often requires more computational power than is available 
from sequential processing. Therefore, the simulation methods using parallel processing are required. 

We have developed a general-purpose simulator for quantum computing on the parallel computer (Sun, 
Enterprise4500) . It can deal with up-to 30 qubits. We have performed Shor's factorization and Grover's 
database search by using the simulator, and we analyzed robustness of the corresponding quantum circuits 
in the presence of decoherence and operational errors. The corresponding results, statistics and analyses are 
presented. 

key words : quantum computer simulator, Shor's factorization, Grover's database search, parallel process- 
ing, decoherence and operational errors 

1 Introduction 

With the current technologies, it seems to be very difficult to implement quantum computers with many qubits. 
It is therefore of importance to simulate quantum algorithms and circuits on the existing computers. The 
purpose of the simulation is 

• to investigate quantum algorithms behavior. 

• to analyze performance and robustness of quantum circuits in the presence of decoherence and operational 
errors. 

However, simulations often require more computational power than is usually available on sequential computers. 
Therefore, we have developed the simulation method for parallel computers. That is, we have developed a 
general-purpose simulator for quantum algorithms and circuits on the parallel computer. Symmetric Multi- 
Processor. 




P: Processor, M: Memory 
Figure 1: SMP (Symmetric Multi-Processors). 
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2 Basic Design 



2.1 Registers 

The simulation is for quantum circuit model of computation. A collection of n qubits is called a register of size 
n. The general qubit state of the n-qubit register is 

2"-l 2"-l 

|(/)) = ^ ai\i) where oii G C , ^ la^p = 1 . 

j=0 1=0 

That is, the state of an n-qubit register is represented by a unit-length complex vector on 7^2"- In a classical 
computer, to store a complex number a = x -\- iy^ one require to store a pair of real numbers {x,y). Each 
real number will be represented by a double precision word. The double precision word is 16 bytes (64bits) on 
most of the systems. 2"+^ bytes memory is therefore required to deal with the state of an n-qubit register in a 
classical computer. 



2.2 Evolution 

The time evolution of an n-qubit register is determined by a unitary operator on 7^2" • The size of the matrix 
is 2" X 2". In general, it requires 2" x 2" space and 2"(2"+^ — 1) arithmetic operations to perform classically 
such an evolution step. 

However, we mostly use operators that have simple structures when we design quantum circuits. That is, 
an evolution step is performed by applying a unitary operator (2 x 2) to a single qubit (a single qubit gate) 
or by applying the controlled unitary operator such as a C-NOT gate. It requires only 2x2 space and 3 • 2" 
arithmetic operations to simulate such an evolution step. 



2.2.1 A Single Qubit Gate 
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Figure 2: Single qubit gate. 
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Suppose that the MSB (most significant bit) is 0-th qubit. When a unitary 
matrix U — ( 

operation applied to the n-qubit register state has the form X = ((S'l^i ^) 
2" X 2" matrix X is the sparse regular matrix shown in 

Figure H. 



is applied to the i-th qubit, the overall unitary 
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(0 < fc < 2^) 



Figure 3: Total unitary matrix. 



We therefore do not have to generate X explicitly. We have only to store the 2x2 matrix U . Since there 
are only 2 non-zero elements for each row in A, the evolution step (i.e., multiply of a matrix and a vector) is 
simulated in 3 • 2" arithmetical operations. 

Parallelization 

Of course, the evolution step (A|(/))) can be executed in parallel. Let 2^ be the number of processors available in 
the simulation system. The evolution step is decomposed into a sequence of submatrix-subvector multiplication 
Mfe (0 < fc < 2*). Mk is defined as Sk(t>k, that is, the multiplication of a submatrix Sk (2"^* x 2"~*) and a 
subvector 0^ whose length is 2"~* (shown in Figure ^). Note that there are no data-dependencies between Mk 
and Ml {k ^ I). Therefore, and Mi are executed in parallel. We assign Mp2i-p , Mp2i-p^i, . . . , M(p^_i)2>-^"-i 
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to a processor p {0 < p < 2^). That is, the processor p computes 2*~^ submatrix-subvector multiphcations, and 
the rests of multiphcations are performed in other processors in parallel. After each processor has finished its 
assigned computations, it executes a synchronization primitive, such as the barrier, to make its modifications 
to the vector ((/>), that is, the state of the register visible to other processors. 
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(0 < fc < 20 



Figure 4: Computation decomposition in the general case. 



When the number of submatrices is smaller than the number of processors (i.e., 2* < 2^), it is inefficient 
to assign the computation Mfc(= S'/j(/)fc, < fc < 2^) to one processor as described above. It can cause a 
load imbalance in the simulation system. In this case, we should decompose the computation Mk itself to 
improve parallel efficiency. Each submatrix Sk is divided into 2^+^ chunks of rows. Each chunk of rows Rj 
(0 < j < 2-^"'"^) contains the contiguous 2"~'~(^+i) rows of Sk- The multiplications using the chunk of rows Rj 
and R2P+j £^re assigned to a processor j as described in the Figure ^. This decomposition is applied to all the 
Mfe computations (0 < fc < 2^. 




Parallel Simulator 



Figure 5: Computation decomposition in the large subblock case. 



Note that the computation using j-th row of the submatrix must be always paired with that using (j + 
2"~'~^)-th row when we use an "in-place" algorithm (i.e.. The results of X\(t)) are stored in \(f))). That is, 
multiplications using the chunk of rows Rj and R2P+j are assigned to the same processor j. This is because 
there are dependencies across processors. Consider the following example. 
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If the 1-st element is computed and the result {xun + yui2) is stored before the 4-th element is computed, 
the result of 4-th element computation becomes not XU21 + y"22 but {xun + y"i2)"2i + y"22- This is wrong. 
To avoid this situation, all the processors have only to execute barrier operations before storing the computed 
results. However, a barrier operation per store operation can cause heavy overheads. 

Therefore, the 1-st element computation and 4-th element computation should be assigned to the same 
processor. Then, the data-dependencies are not cross-processor but in-processor. First, the processor computes 
a;"ii + y"i2 and stores the result in a temporary variable ti on the local storage-area (i.e., stack). Second, 
the processor itself computes the result XU21 -\- y"22 and stores it in the 4-th element. Third, the processor 
stores the contents of the temporary variable ti in the 1-st element. In this way, we can avoid the above wrong 
situation without performing synchronization primitives. If there are no overheads for parallel execution, the 
time complexity is thus reduced to 0(2"~^) where 2^ is the number of processors available in the system. 



2.2.2 A Controlled Qubit Gate 

Suppose that a unitary matrix U — ( ) is applied to the «-th qubit 

I \ "21 "22 / 

c — '- f if and only if the c-th bit (controlled bit) is 1. Let CTX be the over all un itary 
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matrix (2" x 2"). First, we consider the matrix X mentioned in Sec. 2.2.1 as if 
there were no controlled bits. Then, for each j (0 < j < 2" — 1), the j-th row of 
CTX {CTX\j]) is defined as follows. 



Figure 6: Controlled qubit CTX[j\ ^ \ ^"Jj l !^ \ 

, ^ \ I\l\ the c-th bit m 7 IS 

gate. us J 

where / is the unit matrix. In this case, we also do not have to generate CTX or X explicitly. We have 
only to store the 2x2 matrix U . In many controlled bit cases, it is easy to extend this method. The evolution 



step is executed in parallel as described in Sec 2.2.1 . Therefore, the simulation time is 0(2"~^) when there are 
no overheads for parallel execution (2^ is the number of processors available in the simulation system.) 

The simulator provides a f-controlled U gate. It is similar to the controlled U gate. The U gate is applied 
to the target bit iff /(c) = 1 (the c-th bit is the controlled bit). It is used in the Grover's Search Algorithm 



2.2.3 Measurement Gates 

The measurement step for an n-qubit register state is simulated in 0(2") time as follows. Let |0) = Ylj=o^ 
be an n-qubit register state. 

1. Generate a random number r {0 < r < I) 

2. Determine an integer i (0 < * < 2" — 1), s.t. 

j=o J=0 



We consider that the measurement is done with respect to the standard basis 
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2.3 Basic Circuits 
2.3.1 Hadamard Transform 
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The Hadamard transform i7„ is defined as follows, 



n-1 7/ 



for a; G {0, 1}". Hn is implemented by the circuit in Figure where H denotes 
—j= ^1 ) ' ^'''^^^^^^ '^^''^'^'^'^^''^^^^^^'^^^^^^ 

for parallel execution (2^ is the number of processors available in the simulation 
Figure 7: Hadamard cir- system.), 
cuit. 

2.3.2 Quantum Fourier Transform 

The quantum Fourier transform (QFT) is a unitary operation that essentially performs the DFT on quantum 
register states. The QFT maps a quantum state |</)) = X]^=o^ oix\x) to the state Y^x=o^ lix\x), where 
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The circuit implementing the QFT is described in the Figure g. H is the Hadamard gate, and Rd is the 
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phase shift gate denoted as 
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The bit-order reverse 
Figure 8: The QFT2„ chcuit (n = 4). 



For general n, this circuit has O(n^) siz^ Therefore, the evolution step is simulated in 0{n?2"~^) time 
when there are no overheads for parallel execution (There are 2^ processors available in the system) . Of course, 
we can reduce the circuit size to 0{n log(n/e)) |^ if we settle the implementation of fixed accuracy (e), because 
the controlled phase shift gates acting on distantly separated qubits contribute only exponentially small phases. 
In this case, the evolution step is simulated in 0(nlog(n/e)2"^^) when there are no overheads for parallel 
execution. 

If we regard the QFT transform as a black box operator (that is, if we suppose that this QFT circuit has 
no error), we do not have to use this quantum circuit in the simulator to perform QFT transformation. We 
can use fast Fourier transform (FFT) in the simulator instead of the QFT circuit. The FFT algorithm requires 
only 0(ri2"~^) steps when there are no overheads for parallel execution. Of course, the FFT gives the exact 
solution. We use the 8-radix in-place FFT algorithm. 



2.3.3 Arithmetical circuits 

The arithmetical circuits are important for quantum computing pX[ |. In the Shor's factoring algorithm|^, 
the arithmetical circuits to compute modular exponentiation are used. Therefore, according to Ref we 

*There is a quantum circuit that computes QFT (modulo 2") that has the size 0(n{logn)^ log log n) [pj 
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have implemented the modular exponentiation circuit by using constant adders, constant modular adders and 
constant multipliers. a;°(mod N) can be computed using the decomposition, 



X 



,a 




2 . . . oo (binary representation)) 



Thus, modular exponentiation is just a chain of products where each factor is either 1 (a^ = 0) or (a^ = 1). 
Therefore, the circuit is constructed by the pairwise controlled constant multipliers^. 

Let TV be an n bit number, and a a 2n bit number (that is, I is equal to 2n in the above equation.) in the 
Shor's factoring algorithm because a is as large as N"^ . ti + 1 qubits are required as the work-space for the 
controlled multiplier and n + 4 for the controlled adders. The total number of required qubits becomes 5n + 6. 

The circuit is constructed with the 0{l) (that is, 0{n)) pairwise controlled constant multipliers. The 
controlled constant multiplier consists of 0{n) controlled constant modular adders. The controlled constant 
modular adder consists of 5 controlled constant adders. The controlled constant adder consists of 0(n) XOR 
(C-NOT) gates. Thus, the modular exponentiation circuit requires O(n^) gate. Detailed are described in Ref 
M. It is simulated in 0(n^2"~^) when there are no overheads for parallel execution (2^ is the number of 
processors available in the simulation system). 



3.1 Decoherence 

We consider the quantum depolarizing channel as the decoherence error model. In this channel, with probability 
1 — p, each qubit is left alone. In addition, there are equal probabilities p/3 that Gx, cfy, or Gz affects the qubit. 

3.2 Operational Error 

In general, all of single qubit gates are generated from rotations 



For example, we consider _ff„ as t/ft( j)L''pi(7r), and NOT gate as U ii{^)U px{ti) . The simulator represents 
inaccuracies by adding small deviations to the angles of rotation 9 and 4>. Each error angle is drawn from 
Gaussian distribution with the standard deviation (a). 

4 Preliminary Experiment 

We describe the simulation environment and some experiments about basic quantum circuits. 

4.1 Simulation Environment 

We have developed the simulator on the parallel computer. Sun Enterprise 4500 (E4500). The E4500 has 8 
UltraSPARCTI processors (400MHz) with 1MB E-cache and 10GB memory. The system clock is lOOMHz. The 

05 is Solaris 2.8 (64bit OS). The simulator is written in a C language and the compiler that we use is Forte 
Compiler 6.0. The compiler option "-x05 -fast -xtarget=ultra2 -xarch=v9". We use the Solaris thread 
library for multi- processor execution. Under this environment, if we use an in-place algorithm, 30- qubit quantum 
register states can be simulated. 

4.2 Quantum Fourier Transform 

Table || shows the QFT execution time by the simulator using the QFT-circuit and (classical) FFT algorithm. 
The numerical error value is ranged from 10^^^ to 10^^*^. Recall that 2^ be the number of processors available 

tOf course, we must classically compute the numbers (modAf) 
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Table 1: QFT execution time (sec). 
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in the simulation system. The FFT algorithm requires 0(n2"~^) steps and the QFT circuit requires 0(n^2"~^) 
steps for the n-qubit quantum register, if there are no overheads for parallel execution. The execution time is 
increased in exponential order in proportional to n. The execution time of the FFT is about 20 ~ 30 times as 
fast as that of the circuit. Both the execution time are decreased when the number of processors are increased. 
The speedup-ratios on 8-processor execution are about 4 ~ 5. The reason why the speedup-ratios on 8-processor 
execution are not 8 is that the parallel execution has some overheads that single processor execution does not 
have. The parallel execution overheads are operating system overheads (multi-threads creation, synchronization, 
and so on), load imbalance, memory-bus saturation, memory-bank conflict, false sharing and so on. For small- 
size problems, the ratio of overheads to the computation for parallel execution is relatively large and speedup- 
ratios on multi-processor execution may be less than 4. The decoherence and operational errors experiment for 
the QFT is described in Section ||. 

4.3 Hadamard Transform 



Table 2: HT execution time (sec). 
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Table Q shows the Hadamard Transform (HT) execution time by using the circuit. The HT circuit requires 
0(n2"^^) steps for the rt-qubit quantum register. The speedup-ratio on 8-processor execution becomes about 
5. 

4.3.1 Effect of Errors 

We have investigated the decrease of the |0)(0| term in the density matrix for the 20-qubit register. 
Decoherence Errors 

We have analyzed decoherence in the HT circuit on the depolarizing channel. Of course, the simulation deals 
with pure states. Therefore, the experiments were repeated 10000 times and we use the average values. Each 
experiment uses different initial random seed. The start state of the quantum register is |00 . . .0) = |0). The 
HT circuit is applied to the quantum register over and over. The x-axis in the Figure ^ shows the even iteration 
number. If there are no errors (i.e., the error probability is 0) and the number of iteration is even, the state 
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Figure 9: Decrease of the |0)(0| term in the density matrix (20 qubits). 

remains |0) and |0) (0| term in the density matrix remains 1. Figure || shows how decoherence errors degrade the 
|0)(0| term. The noise degrades the |0)(0| term significantly if the error probabihty is greater than 10~'^. When 
the error probabihty is 10^^, the |0)(0| term is decreased in exponential order in proportional to the number of 
iterations. 

In this easy case, we can compute |0)(0| term in the density matrix theoretically. First, consider the 1 qubit 
case. Let p be the error probability and pk be the density matrix after the HT circuit is applied to the quantum 
register k times. The density matrix pk+i is calculated as follows. 

Pfe+i - {\-p)HpuH* + ^a,HpkH*a,* + ^ayHpkH*ay* + ^a,HpkH*a,*. 
When the start state of the quantum register is |0) and k is even, pk is calculated as follows, 



1 f l + (l-|p)'= 



^'^ 2 V 1 - (1 - Ip)'' 

In the n— qubit case, we can calculate the density matrix similarly when the start state of the quantum register 
is |0, . . . , 0) and k is even. |0)(0| term of pk is 

^ 2 ' ' 

Figure ^ also shows this theoretical value of |0)(0| term in the density matrix when p — 10~^ ~ 10~^ and 
n = 20. We can see that the simulations and the theoretically computations yield almost the same result. 
Operational Errors 

The simulator represents inaccuracies by adding small deviations to the two angles of rotations. Since 
H — UR{j)Upi{n), we add small deviations x and y to j and n respectively. That is, we use H{x,y) = 
Ur{j +x)Upi{t: + y) as the H gate in this experiment, x and y are drawn from Gaussian distribution with the 
standard deviation (a). As mentioned above, the experiments are executed 10000 times and we use the average 
value. Each experiment uses different initial random seed. Figure |l^ shows how operational errors degrade the 
|0)(0| term when a ~ 10^^ ^ 10^^ and n = 20. The |0)(0| term is not affected by the operational error if a is 
less than 10~^. 

In this case, we can also compute |0)(0| term in the density matrix theoretically. First, consider the 1 qubit 
case. Let pk be the density matrix after the HT circuit is applied to the quantum register k times. The density 
matrix pk+i is calculated as follows. 

/•OO f-OO 

Pk+i= / H{x,y)pkH{x,y)*p{x)p{y)dxdy 
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Figure 10: Decrease of the |0)(0| term in the density matrix (20 qubits). 



where p{z) — ^^^ e ^ . When the start state of the quantum register is |0 . . . 0) = |0), pk is calculated as 
follows, 




As for the general n— qubit case, we can calculate the density matrix similarly when the start state of the 
quantum register is |0 . . . , 0) and k is even. |0) (0| term of pk is 

^ 2 ' ' 

Figure || also shows this theoretical value of |0)(0| term in the density matrix when the standard deviation 
a = 10~^ ~ 10~^ and n = 20. It follows from the theoretical computation that |0)(0| term is decreased in 
exponential order in proportional to the number of iterations k. 
Both Operational and Decoherence Errors 



Table 3: Combined effects for HT. 
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Each element of Table ^ represents the |0)(0| term of the density matrix after the HT is applied to the state 
|0) of a 20-qubit register 10000 times. The combined effect of two factors may be worse than each factor alone, 
that is to say, the effect seems to be the product of each factor. Table shows this situation. 



5 Experiment 
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5.1 Shor's Factorization Algorithm ^ 

First, we review the algorithm briefly. 

Input An I bit odd number n that has at least two distinct prime factors. 
Output A nontrivial factor of n 

1. Choose an arbitrary x £ {1, 2, . . . , — 1} 

2. (Classical) Compute d — gcd{x,n) by using Euclid's algorithm. If d > 1, output d and stop. 

3. (Quantum) Try to find the order of x: 

(a) Initialize an Z-qubit register and a 2Z-qubit register to state |0)|0). 

(b) Apply the HT to the second register. 

(c) Perform the modular exponentiation operator. 
That is, |0)|a) |a;''(mod n))\a) 

(d) Measure the first register and apply the QFT to the second register and measure it. Let y be the result. 

4. (Classical) Find relatively prime integers k and r (0 < fc < r < n), s.t. |^ — 7I < ^(21+1) by using the 
continued fraction algorithm. If x^ ^ l(mod7i) or r is odd or x^^^ = ±l(mod7i), output "failure" and stop. 

5. (Classical) Compute d± = gcd(n,a::2 ± 1) by using Euclid's algorithm. Output numbers d± and stop. 

When the simulator performs all the step-3 operations (not only the QFT but also the m odular exponenti- 
ation) on the quantum circuit, 51 + 6 qubits are totally required, as described in the Section 2.3.3. Therefore, 
the simulator can only deal with 4-bit integer n {51 + 6 <= 30 ^ I < 4). The 4-bit integer that satisfies the 
input property is only 15. We have tried to factor 15 on the simulator. Beyond our expectation, the modular 
exponentiation is computationally much heavier than the QFT. 



Table 4: Execution time in the Shor's factorization algorithm when n = 15 and a; = 11 (All the quantum 
operations are executed on the circuit). 



Modular exponentiation 


QFT 


18184 (sec) 


0.64270 (sec) 



The modular exponentiation requires 0(Z^2'^^) steps and the QFT on the circuit requires 0(^2'^-^) steps 
when there are 2^ processors available in the simulation system and there are no overheads for parallel execu- 
tion. Of course, in the classical computer, modular exponentiation consists of basic operations such as addition, 
multiplication and division. However, these basic operations are not so heavy in the classical computer, because 
it has the dedicated non-reversible circuit (the so-called ALU :arithmetic logic unit). This situation suggests 
that a brand-new fast quantum algorithm for arithmetic operations are required. 15 is not enough to investigate 
the behavior of Shor's factoring algorithm. To factor much larger number in a reasonable time, the simulator 
performs the step-3(c) and the step-3(d) classically. That is, the modular exponentiation are computed classi- 
cally and the QFT is computed by the FFT algorithm in the simulator. In this case, the simulator does not 
need to generate the first register. Therefore, the simulator can factor about 14 ^ 15-bit integers (for example, 
23089). 

The factoring algorithm successes with the probability greater than 

Probsucc("-) = Pstep2 + (1 -Pstep2)Pstep3~4 

(1 _ i(!!l) + (I 4 e-^ 

^ n-l' 71-1 4 TT^loglogn'^ 

where Pstep2 means the probability that the stcp-2 successes and Pstep3~4 means the probability that step-3 and 
the step-4 success and 7 is the Euler constant (l>{n) is the Euler number of n. If the above algorithm is repeated 
0(l/Probsucc("-)) times, the success probability can be as close to 1 as desired. 

We choose an n = pq where p and q are prime numbers. This kinds of integers are chosen in an RSA 
cryptosystem because it is believed that it is hard to factor such integers easily. 4>{n) = {p — l){q — 1) for such 
integers. We have experimented with several RSA-type 14 ~ 15-bit integers. 

The simulator repeats the above algorithm until a nontrivial factor of n is found. The simulator records the 
number of iterations. The experiment is executed 100 times and we use the average of these recorded iterations. 
We compare the simulation values with the theoretical number of needed iterations (i.e.,l/Probsucc(f^))- The 
results are shown in the Table |[ Theoretical values (Theoretical) are about only 2^4 times as large as 
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Table 5: Number of needed iterations of Shor's factoring algorithm. 



n 


Num. of Iterations 


Theoretical 


Simulation 


Original 


Improved 


21311(= 


211 


101) 


15.79 


6.690 


1.760 


21733(= 


211 


103) 


15.85 


8.990 


2.356 


22999(= 


211 


109) 


16.00 


6.360 


1.730 


22523(= 


223 


101) 


15.88 


5.480 


1.770 


22927(= 


227 


101) 


15.91 


3.790 


1.470 


22969(= 


223 


103) 


15.94 


8.050 


2.070 


23129(= 


229 


101) 


15.92 


7.133 


1.636 



simulation values (Original). Although much more simulations are required, the theoretical values seem to be 
fairly good. 

As suggested in Ref the algorithm is optimized so as to perform less quantum computation and more 
(classical) post-processing. 

1. Neighbor y Check 

If we do not find the relatively prime integers k and r by using the continued fraction algorithm, it is wise 
to try y ± 1, y ± 2. 

2. CCD Check 

Even if x'" ^ 1 (mod n), try to compute d± — gcd(n, ± 1). 

3. Small Factor Check 

If x'' ^ l(mod n), it is wise to try 2r, 3r . . .. This is because if ^ ~ ^, where k and r have a common 

factor, this factor is likely to be small. Therefore, the observed value of ^ is rounded off to ^ in the 
lowest terms. 

4. LCM Check 

If two candidates for r, that is ri and r2, have been found, it is wise to test lcm(ri, r2) as a candidate r. 

We have tested how much the algorithm is improved by these modifications. The results are also shown in 
Table ^ (Improved). The number of iterations are reduced to about 1/5 ~ 2/5. The detailed effect of the 
improved algorithm is described in Table M. 



Table 6: Detailed effect of improved algorithm 



n 


Ratio of Success/Failure 


1 (Neighbor) 


2(GCD) 


3(SF) 


4(LCM) 


21311 


27/9 


52/19 


12/4 


3/4 


23129 


27/9 


52/19 


12/4 


3/4 


22999 


37/6 


47/79 


13/8 


2/58 


22969 


41/8 


22/82 


31/22 


1/28 


22927 


25/3 


35/49 


18/2 


1/28 


22523 


37/6 


45/76 


18/22 


7/54 



Each element of Table g represents s// where s means the number of success iterations and / means 
the number of failure iterations. For example, about n — 23129, the first optimization, "Neighbor Check" is 
performed for 27 + 9 = 36 iterations and the candidate of the order is found successfully in 27 iterations. It seems 
that the second optimization "GCD Check" works well for all the n that we have experimented with. From this 
result, we can see that even if x"^ ^ l(mod n), d± — gcd(n, x^ ± 1) often become the factor of n. That is, even 
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if the candidate r is not equal to ord{x) (an order of a:), there is the possibiUty that N3 3a>l, a-r = ord{x). 
In this case, the following equation holds when r is even. 

O(mod n) = a;°''''(^) - 1 

= (.t'' - + + . . . + 1) 

Thus, there is the possibility that n and ±1 have a common non-trivial factor. 
5.2 Effect of Errors 

We have analyzed decoherence and operational errors in the QFT circuit. Decoherence Errors 
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Figure 11: Amplitude amplification by QFT in the presence of decoherence error (top) and the required number 
of iterations (bottom) (16 qubits). 



We assume that each qubit is left intact with probability 1 — p and it is affected by each of the error operators 
crx,cTy,az with the same probability | each time the register is applied by the controlled rotation gate Rd- 
Figure |ll] shows the amplitude amplification phase by the QFT circuit on the depolarizing channel in Shor's 
factorization algorithm (Step 3 (d)) when n = 187 and x = 23. The y axe in the Figure p] shows the amplitude. 
The experiment is executed 1000 times and we use the average. If the error probability is greater than 10^'^, it 
is hard to use the QFT circuit for the purpose of period estimation. 
Operational Errors 

The simulator represents inaccuracies by adding small deviations to the angles of rotations of Rd- We 
consider Hn — Ur{j)Upi{tt) , and NOT gate = Uii{^)Upi{tt). The simulator also represents inaccuracies by 
adding small deviations to these angles of rotations. The error is drawn from Gaussian distribution with the 
standard deviation (cr). As mentioned above, the experiment is executed 1000 times and we use the average. 
Figure |l^ shows the amplitude amplification phase by the QFT in the Shor's factorization algorithm (Step 
3(d)) when n — 187 and x = 23. It seems that the period extraction by using the QFT is not affected by the 
operational error. 

Both Operational and Decoherence Errors 

We investigate the combined effect of operational and decoherence errors. Table ^ shows the result. Each 
element of table represents the fidelity. The fidelity is defined as the inner product of the correct state and the 
simulated state with errors. 
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Figure 12: Amplitude amplification by QFT in the presence of operational error (top) and the required number 
of iterations (bottom) (16 qubits). 



The combined effect of two factors may be worse than each factor alone, that is to say, the effect seems 
to be the product of each factor. However, when the decoherence rate is relatively higher, the small-deviation 
operational error can improve the results contrary to our expectations. When the size of register is large, the 
decoherence probability even greater than 10~^ drops the fidelity significantly. 



Table 7: Combined effects for QFT (16bit) 
Operational(cr) 








10-=' 


10-4 


10--^ 


10-^ 







1.0000 


0.9999 


0.9999 


0.9999 


0.9998 


10 


-5 


0.9880 


0.9840 


0.9860 


0.9880 


0.9848 


10 


-4 


0.8837 


0.8897 


0.8827 


0.8801 


0.8980 


10 


-3 


0.3287 


0.3399 


0.3332 


0.3209 


0.3363 


10 


-2 


0.0027 


0.0015 


0.0019 


0.0017 


0.0031 



5.3 Grover's Search Algorithm 

Suppose that a function fk : {0, 1}" — > {0, 1} is an oracle function such that fk{x) = 5xk- The G-iteration is 
denoted as —HnVf^HnVf^. The sign-changing operator Vf is implemented by using the /-controlled NOT gate 
and one ancillary bit. Figure |l^ shows the circuit of Grover's algorithm. 

5.3.1 Effect of Errors 

We have analyzed the impacts of decoherence and operational errors in the circuit of Grover's algorithm. We 
assume depolarizing channel that each qubit is left intact with probability 1—p and it is affected by each of the 
error operators ax,cry,cTz with the same probability | per G-iteration. We consider = UR{j)Upi{Tr) and 
NOT-gate = Ur{^)Upi{tt). The simulator represents inaccuracies by adding small deviations to the angles of 
these rotations. Each error angle is drawn from Gaussian distribution with the standard deviation (ct). 

Figure ^ and |l^ show the impacts of errors for a 10-qubit register. The experiments were repeated 1000 
times and we use the average values. If there are no errors, plotting the amplitude of the correct element (that 
is, k) makes a sine curve. However, the amplitudes are decreased as G-iterations are repeated in the presence 
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Figure 13: The circuit of Grover's algorithms. 
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Figure 14: Decrease of the ampUtude of the correct element in the presence of decoherence errors (10 qubit). 
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Figure 15: Decrease of the amplitude of the correct element in the presence of operational errors (10 qubit). 
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of errors. Figure |IJ shows the impacts of decoherence error. We can see that the decoherence error affects the 
period of the sine-curve. Figure 15 shows the impacts of operational errors. It seems that the operational error 
does not affect the period of the sine-curve. 



6 Related Works 

There are many quantum simulators for quantum circuit model of computation |^. |6; _|ll[ . QDD[Q aims to 
use Binary Decision Diagram in order to represent the states of quantum register. QCL^fand OpenQubit |]ll| 
both use complex number representation of the quantum states like our simulator. In addition, QCL tries to 
establish a high-level, architecture-independent programming language. The Obenland's simulator Q is based 
on an actual physical experimental realization and it uses parallel processing like our simulator. Although it 
runs on the distributed-mcmory multi-computers, our simulator runs on the shared-memory multi-computers. 
Therefore, in our simulator, there is no need to distribute and collect the states of the quantum register. In 
addition, our simulator uses more efficient evolution algorithms and adopts (classical) FFT algorithms for the 
fast simulation of the large-size problems. Our simulator does not depend on any actual physical experimental 
realizations because it is not easy to say which realizations are best at this moment. In other words, our 
simulator is more general-purpose. 



7 Conclusion 

We have developed a parallel simulator for quantum computing on the parallel computer (Sun, Enterprise4500). 
Up-to 30 qubits can it deal with. We have performed Shor's factorization and Grover's database search by using 
the simulator, and we analyzed robustness of the corresponding quantum circuits in the presence of decoherence 
and operational errors. If the decoherence rate is greater than 10"^, it seems to be hard to use the both 
quantum algorithms in practice. For future work, we will investigate the correlation between decoherence and 
operational errors, that is, why small-deviation operational errors can improve the results when the decoherence 
rate is relatively higher. Furthermore, we will try quantum error-correcting code to fight decoherence and 
operational errors. 
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(c) SMP Cluster 



